Skip to content

Conversation

@tkatila
Copy link
Contributor

@tkatila tkatila commented Nov 19, 2025

  • Some cleanup (remove unused NFD consts)
  • Updates to levelzero components
  • More temperature limit options
  • Container size optimization for gpu_levelzero

Fixes #2166

@tkatila tkatila requested a review from eero-t November 19, 2025 11:42
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits / comments.

apt-get update && apt-get install --no-install-recommends -y ocl-icd-libopencl1 && \N
rm /runtime/level-zero-devel_*.deb && \N
cd /runtime && dpkg -i *.deb && rm -rf /runtime && \N
apt-get update && apt-get install --no-install-recommends -y ocl-icd-libopencl1 wget ca-certificates && \N
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the main size reduction from this duplication comes actually from removing the accidentally left (large) downloaded deb files, not from dropping l0-dev, wget, certs & their deps. Did you check that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mainly comes from this:
COPY --from=builder /runtime /runtime
The runtime deb packages are copied from the build to the final phase, and while they are removed after the install the copy creates a large unnecessary layer. This is evident if you open the container in dive.

Would be nice if one could install packages directly from the build phase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker does not support host volumes for builds (needs extension), but podman does. What if packages were on a host tmp volume (-v $(mktemp -d):/temporary:rw), I don't think those go to the final image?

@tkatila tkatila force-pushed the gpu-misc-updates branch 2 times, most recently from 426558c to f270956 Compare November 20, 2025 09:30
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Additional container improvements don't need to be part of this PR.

Signed-off-by: Tuomas Katila <[email protected]>
Comment on lines 407 to 409
globalTempLimit := float64(dp.options.globalTempLimit)
memoryTempLimit := float64(dp.options.memoryTempLimit)
gpuTempLimit := float64(dp.options.gpuTempLimit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this follows the existing functionality but I wonder if we could hide this float stuff befind the levelzero API and deal with (u)ints here. or is the cgo api used elsewhere too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp is in Celsius, not in Kelvin, so (in some contrived corner-case) it could be also negative. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two options:

  1. Move to integer in the CGo portion. It would change the protobuf definition and would break compatibility between older and newer plugin&levelzero containers. Probably not a big thing, but something worth thinking.
  2. Move to integers in the levelzero-bridge golang code (inside the plugin).

I'll see how the changes would look like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something like 2. but it's not a blocker for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change.

Use the existing "temp-limit" as the global limit, and introduce GPU
and memory thresholds.

Signed-off-by: Tuomas Katila <[email protected]>
Fix uninitialized variable that caused random behaviour.

Signed-off-by: Tuomas Katila <[email protected]>
By re-downloading the components, we save on the overall container size.
While the build time increases slightly, the container size drops by
around 100M (520->420).

Signed-off-by: Tuomas Katila <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Per-Temperature Health Checks in GPU Plugin

3 participants